Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Workflow] AMR-search for neisseria gonorrhoeae samples #743

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

fraser-combe
Copy link
Contributor

@fraser-combe fraser-combe commented Feb 3, 2025

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

This PR creates a standalone workflow for PathogenWatch AMR-search in order to utilize the functionality of its AMR resistance profiling steps. This is processed by an integrated Python script parse_amr_json to extract relevant data into a CSV file and PNG summary table for visualization. This workflow will likely be integrated into TheiaProk later down the road.

Documentation for this new workflow has been created.

⚡ Impacted Workflows/Tasks

This PR may lead to different results in pre-existing outputs: No

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

Implementation of wf_amr_search.wdl along with task_amr_search.wdl
This includes the building of a new docker container with PathogenWatch AMR-search and PAARSNP installed.

⚙️ Algorithm

Using a microbial FASTA file, PAARSNP is run and generates a JSON file containing AMR profiling information. This JSON is then passed to a python script parse_amr_json.py which is housed within the docker container us-docker.pkg.dev/general-theiagen/theiagen/amrsearch:0.2.0. This script then parses the information within the JSON and creates a CSV and PNG that resemble the output given from Pathogenwatch's AMR profile.

➡️ Inputs

  • input_fasta -> a microbial FASTA file
  • samplename -> Name which the user wants prefixed to output files
  • amr_search_database -> The NCBI taxon code that is used by PAARSNP to pull the correct .toml file from the amr-libraries stored within the docker container.

⬅️ Outputs

  • amr_search_results -> JSON output of AMR profiling information
  • amr_results_csv -> CSV format of AMR profiling information
  • amr_results_png -> PNG format of AMR profiling information, resembling the Pathogenwatch PDF output

🧪 Testing

  • Verified that AMR results are correctly parsed and formatted in JSON, CSV, and PNG outputs locally and in Terra
  • Verified that AMR results differ minimally or are identical to Pathogenwatch outputs. "Differ minimally" due to Pathogenwatch utilizing AMRsearch libraries v0.0.17 and we are using v0.0.20. Testing indicates minimal to no differences in outputs.

Initial Terra Test
Test Containing All Species
E. coli were not included in this test as there were no publicly available examples available in Pathogenwatch. PAARSNP/AMRSearch was not run prior to a certain date.
GCA_011383385_typhi: Newer database of 0.0.20 has additional sul2 predicted
GCA_042331435_GC: Newer database of 0.0.20 has additional tetM and rpsj_V57M predicted. Inferred resistance of tetracycline from intermediate to resistant.

Suggested Scenarios for Reviewer to Test

wf_amr_search.wdl provides the correct outputs, PNG, JSON, and CSV.
If Pathogenwatch was being used previously, run against existing results.

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable
    • You have updated the "Last Known Changes" field for any affected workflows in the respective workflow documentation page and for every entry in the three workflows_overview tables to be the tag for the next upcoming release. If you do not know the tag, please put "vX.X.X"

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

Copy link
Contributor

@AndrewLangvt AndrewLangvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial changes here @awh082834. Mostly minor documentation/namespace polishing. Well done with the overarching schema. We'll see what the UAT brings about for feedback.

workflow amr_search_workflow {
input {
File input_fasta
String amr_search_database
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider changing this as the user doesn't need to pass in a DB, only the taxon code that then references the correct DB to use. Taxon/taxon_of_interest/taxon_code might make more sense. Some of this will depend on how we do the mapping when we plug into TheiaProk, but just dropping this as a note for future work when we implement that integration upstream.


## AMR_Search_PHB

The AMR_Search workflow is a standalone version of Pathogenwatchs AMR profiling functionality utilizing `AMRsearch` tool from Pathogenwatch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pathogenwatch's

| amr_search | **disk_size** | Integer | Amount of storage (in GB) to allocate to the task |50| Optional |
| amr_search | **docker** | String | The docker container to use for the task |us-docker.pkg.dev/general-theiagen/theiagen/amrsearch:0.2.0| Optional |
| amr_search | **memory** | Integer | Amount of memory/RAM (in GB) to allocate to the task |8| Optional |
| amr_search_workflow | **amr_search_database** | String | NCBI taxon code of samples known taxonomy, see above supported species || Required |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move up so required inputs are listed first


This task performs *in silico* antimicrobial resistance (AMR) profiling for *Neisseria gonorrhoeae* using **AMRsearch**, the primary tool used by [Pathogenwatch](https://pathogen.watch/) to genotype and infer antimicrobial resistance (AMR) phenotypes from assembled microbial genomes.

**AMRsearch** screens against an in-house library of curated genotypes and inferred phenotypes, developed in collaboration with community experts. Resistance phenotypes are determined based on both **resistance genes** and **mutations**, and the system accounts for interactions between multiple SNPs, genes, and suppressors. Predictions follow **S/I/R classification** (*Sensitive, Intermediate, Resistant*).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't Theiagen's "in-house library" so we probably want to adjust this. I am guessing this is a copy/paste from PW description of AMR_search (totally fine), but we'll want to have it reflect Theiagen, not PW. I.e. "Screens against Pathogenwatch's library of curated genotypes..."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes this was a holdover from original documentation that I decided to keep. In context it definitely sounds like its our library. Ill get this changed!

| Software Documentation | [Pathogenwatch](https://cgps.gitbook.io/pathogenwatch) |
| Original Publication(s) | [PAARSNP: *rapid genotypic resistance prediction for *Neisseria gonorrhoeae*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7545138/) |

!!! techdetails "`parse_amr_json.wdl` Details"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

| amr_search_docker | String | Docker image used to run AMR_Search |
| amr_search_version | String | Version of AMR_Search libraries used |

## References (if applicable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "(if applicable)"

}
command <<<
# Extract base name without path or extension
# Added suffix strip to handle cases of differing FASTA extensions. Was hard coded to .fasta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Strip suffix to handle cases of differing FASTA extension."

we need not describe what was added/what previously existed, only comment the existing code for functional understanding/comprehension

# Move the output file from the input directory to the working directory
mv $(dirname ~{input_fasta})/${input_base}_paarsnp.jsn ./~{samplename}_paarsnp_results.jsn

python3 /scripts/parse_amr_json.py \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth a comment here with a link to the location of this script, for posterity's sake

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Ill put the link to the current dev branch of the docker builds repo and will update it when it gets merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants